An in-depth walkthrough of the essential metrics used to evaluate machine learning model performance β with mathematical formulas and Python examples.
The cornerstone of all classification metrics β a table that summarizes a model's correct and incorrect predictions.
A Confusion Matrix is a table used to visualize the performance of a classification model. For binary classification it is a 2Γ2 matrix with four key components:
| Predicted | ||
|---|---|---|
| Actual | Positive (+) | Negative (β) |
| Positive (+) | TP True Positive |
FN False Negative |
| Negative (β) | FP False Positive |
TN True Negative |
Actually positive and correctly predicted as positive. E.g., a sick patient correctly diagnosed as sick.
Actually negative but incorrectly predicted as positive. Also known as a Type I Error. E.g., a healthy person flagged as sick.
Actually positive but incorrectly predicted as negative. Also known as a Type II Error. E.g., a sick patient missed by the test.
Actually negative and correctly predicted as negative. E.g., a healthy person confirmed as healthy.
Consider an email spam filter classifying 1,000 emails:
TP = 80 (spam correctly caught) Β· FP = 10 (legitimate emails
marked as spam)
FN = 20 (spam emails missed) Β· TN = 890 (legitimate emails
correctly passed)
The most intuitive metric β but one that can be misleading when classes are imbalanced.
Accuracy is the ratio of correct predictions (both positive and negative) over the total number of predictions.
Spam filter: $\text{Accuracy} = \frac{80 + 890}{80 + 890 + 10 + 20} = \frac{970}{1000} = \mathbf{0.97}$ β 97%
Accuracy can be misleading with class imbalance. For example, if only 10 out of 1,000 patients have cancer, a model that never predicts cancer still achieves 99% accuracy β yet fails to detect a single case! Use Precision, Recall, and F1 Score for imbalanced data.
Measures how trustworthy the model's positive predictions are.
Precision is the fraction of positive predictions that are actually correct. It answers: "Of everything I labeled positive, how much was truly positive?"
Spam filter: $\text{Precision} = \frac{80}{80 + 10} = \frac{80}{90} = \mathbf{0.889}$ β
88.9%
Out of 90 emails flagged as spam, 80 were actually spam.
Focus on Precision when false positives are costly:
β’ Spam Filter: Blocking an important email is risky
β’ Search Engine: Irrelevant results degrade user experience
β’ Recommendation System: Wrong recommendations erode trust
Measures how many of the actual positive cases the model successfully captures.
Recall (also called Sensitivity or True Positive Rate) is the fraction of actual positives that the model correctly identified. It answers: "Of all true positives, how many did I catch?"
Spam filter: $\text{Recall} = \frac{80}{80 + 20} = \frac{80}{100} = \mathbf{0.80}$ β
80%
Out of 100 actual spam emails, our model caught 80 and missed 20.
Focus on Recall when false negatives are costly:
β’ Cancer Screening: Missing a patient can be life-threatening
β’ Fraud Detection: Missed fraud leads to major financial loss
β’ Security Systems: Undetected threats pose serious risks
Precision and Recall typically move in opposite directions. Lowering the decision threshold captures more positives (Recall β) but also increases false positives (Precision β). The F1 Score balances this trade-off.
A single number that balances Precision and Recall using the harmonic mean.
The F1 Score is the harmonic mean of Precision and Recall. The harmonic mean is used instead of the arithmetic mean because it penalizes extreme differences between the two values more heavily.
Spam filter: $F_1 = 2 \times \frac{0.889 \times 0.80}{0.889 + 0.80} = 2 \times \frac{0.711}{1.689} = \mathbf{0.842}$ β 84.2%
When you want to weight Precision or Recall differently, use the F-Beta Score:
| Score | Ξ² Value | Weight | Use Case |
|---|---|---|---|
| F0.5 | 0.5 | Favors Precision | Spam filters, search engines |
| F1 | 1.0 | Equal weight | General classification problems |
| F2 | 2.0 | Favors Recall | Medical diagnosis, security systems |
Measures how well the model identifies true negatives.
Specificity (True Negative Rate) is the proportion of actual negatives correctly identified by the model. It is the counterpart of Recall for the negative class.
Spam filter: $\text{Specificity} = \frac{890}{890 + 10} = \frac{890}{900} = \mathbf{0.989}$ β
98.9%
98.9% of legitimate emails were correctly identified.
Sensitivity (Recall): How well do we detect the sick?
Specificity: How well do we identify the healthy?
Together they form the basis of the ROC curve.
A powerful visualization and comparison tool that shows model performance across all threshold values.
The ROC (Receiver Operating Characteristic) curve plots the True Positive Rate (Recall) against the False Positive Rate (1 β Specificity) at various classification thresholds.
AUC is the area under the ROC curve. It ranges from 0 to 1 and measures the model's overall ability to distinguish between classes.
| AUC Value | Rating | Interpretation |
|---|---|---|
| 1.0 | Perfect | Model perfectly separates all classes |
| 0.9 β 1.0 | Excellent | High discriminative power |
| 0.7 β 0.9 | Good | Acceptable performance |
| 0.5 β 0.7 | Poor | Slightly better than random |
| 0.5 | Random | Equivalent to a coin flip |
AUC is threshold-independent and ideal for comparing models. It is more reliable than Accuracy for imbalanced datasets.
Evaluates the quality of predicted probabilities β not just right or wrong, but how confident the model is.
Log Loss (Binary Cross-Entropy) measures how well the model's predicted probabilities match the true labels. Lower Log Loss = better model.
Where $y_i$ is the actual label (0 or 1), $\hat{y}_i$ is the predicted probability, and $N$ is the total number of samples.
β’ When the model's confidence level matters, not just the class label
β’ When performing probability calibration
β’ Frequently used as a scoring metric in Kaggle competitions
Error and goodness-of-fit measures for models that predict continuous values.
The average of absolute differences between predicted and actual values. Robust to outliers.
The average of squared differences between predicted and actual values. Penalizes larger errors more heavily.
The square root of MSE, bringing the error back to the original scale. The most widely used regression metric.
Indicates how much of the variance in the dependent variable is explained by the model. Closer to 1 = better fit.
| RΒ² Value | Interpretation |
|---|---|
| 1.0 | Perfect fit β model explains all variance |
| 0.7 β 1.0 | Good fit |
| 0.4 β 0.7 | Moderate fit |
| < 0.4 | Weak fit β model doesn't explain data well |
| < 0 | Model is worse than predicting the mean |
Selecting the right metric is as important as selecting the right model.
| Scenario | Recommended Metric | Why? |
|---|---|---|
| Balanced dataset | Accuracy, F1 | Accuracy is reliable when classes are balanced |
| Imbalanced dataset | F1, AUC, Precision/Recall | Accuracy can be misleading |
| False positives are costly | Precision | Minimize FP |
| False negatives are costly | Recall | Minimize FN |
| Probability estimates matter | Log Loss, AUC | Evaluates model confidence |
| Model comparison | AUC | Threshold-independent comparison |
| Continuous value prediction | RMSE, MAE, RΒ² | Designed for regression problems |
Computing all metrics with scikit-learn.
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, confusion_matrix, classification_report,
roc_auc_score, log_loss
)
import numpy as np
# Ground truth and predicted labels
y_true = np.array([1, 0, 1, 1, 0, 1, 0, 0, 1, 0])
y_pred = np.array([1, 0, 1, 0, 0, 1, 1, 0, 1, 0])
# Compute all metrics
print("Accuracy :", accuracy_score(y_true, y_pred))
print("Precision:", precision_score(y_true, y_pred))
print("Recall :", recall_score(y_true, y_pred))
print("F1 Score :", f1_score(y_true, y_pred))
# Confusion Matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_true, y_pred))
# Detailed report
print("\nClassification Report:")
print(classification_report(y_true, y_pred))
from sklearn.metrics import (
mean_absolute_error, mean_squared_error, r2_score
)
import numpy as np
y_true = np.array([3.0, 5.0, 2.5, 7.0, 4.5])
y_pred = np.array([2.8, 5.2, 2.1, 6.8, 4.9])
print("MAE :", mean_absolute_error(y_true, y_pred))
print("MSE :", mean_squared_error(y_true, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_true, y_pred)))
print("RΒ² :", r2_score(y_true, y_pred))